IODS-course introduces the basics of data science, and touches upon themes of open science and the use of open data. During the course participants learn basic tools to use with open data with R, and learn the principles of open science.
# This is a so-called "R chunk" where you can write R code.
date()
## [1] "Mon Dec 5 15:57:19 2022"
Assignment1
Exercise set 1 is fairly good crashcourse to R. It introduces a lot of useful commands that are often needed to tidy the data for further analysis. Most of the commands were familiar, but I did not know that one can use pipe when doing analyses. That will be helpful with e.g. doing analyses for subset of the data. For a introduction-level book I prefer data analysis with R (the one with the parrot). Having read a couple of introductory level books on R and data-analysis I think the differences are minor, and at least any of the ones I have read would suffice for a beginner.
Did the thing with the token, took some time, but it works. Who would have thought that university laptops would require remote access to install GIT.
link to github diary: https://esoini.github.io/IODS-project/ repository: https://github.com/esoini/IODS-project
Describe the work you have done this week and summarize your learning.
Show a graphical overview of the data and show summaries of the variables in the data. Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them. (0-3 points)
Choose three variables as explanatory variables and fit a regression model where exam points is the target (dependent, outcome) variable. Show a summary of the fitted model and comment and interpret the results. Explain and interpret the statistical test related to the model parameters. If an explanatory variable in your model does not have a statistically significant relationship with the target variable, remove the variable from the model and fit the model again without it. (0-4 points)
Using a summary of your fitted model, explain the relationship between the chosen explanatory variables and the target variable (interpret the model parameters). Explain and interpret the multiple R-squared of the model. (0-3 points)
Produce the following diagnostic plots: Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage. Explain the assumptions of the model and interpret the validity of those assumptions based on the diagnostic plots. (0-3 points)
date()
## [1] "Mon Dec 5 15:57:19 2022"
Reading the data
rm(list = ls())
#setwd("/Users/eetusoin/Desktop/IODS")
lrn14<-read.csv("./data/learning2014.csv")
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
fig <- ggpairs(lrn14, mapping = aes(alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))
fig
lm<-(lm(points~attitude+stra+surf, data = lrn14))
summary(lm)
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = lrn14)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## attitude 3.3952 0.5741 5.913 1.93e-08 ***
## stra 0.8531 0.5416 1.575 0.11716
## surf -0.5861 0.8014 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
lm1<-(lm(points~attitude+stra, data = lrn14))
summary(lm1)
##
## Call:
## lm(formula = points ~ attitude + stra, data = lrn14)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.6436 -3.3113 0.5575 3.7928 10.9295
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.9729 2.3959 3.745 0.00025 ***
## attitude 3.4658 0.5652 6.132 6.31e-09 ***
## stra 0.9137 0.5345 1.709 0.08927 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared: 0.2048, Adjusted R-squared: 0.1951
## F-statistic: 20.99 on 2 and 163 DF, p-value: 7.734e-09
lm2<-(lm(points~attitude, data = lrn14))
summary(lm2)
##
## Call:
## lm(formula = points ~ attitude, data = lrn14)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
##1 Data consist of ASSIST survey (Approaches and Study Skills Inventory for Students) and the measures were originally reported in finnish. Data has 166 observations and 7 variables, which are gender(male/female), age, mean scores in attitude scale.
##2 The distributioin for attitude, deep, stra, surf and points look fairly normally distributed. The distribution for age is positively skewed (longer tail on the right) as most of the student taking the survey have been young adults, and the plot shows some outliers. Strongest correlation is seen between attitude and points (.4), and the next strongest associations are with stra (r= .146) and surf (r= .-.144).
##3 Of the three variables used to predict points (attitude,stra,surf, had the strongest correlations), only attitude was associated with points (statistaical signifigance >.001). Lm was fitted again without the surf variable. As the surf was not statististically significant at alphalevel =.05, the final model had only attitude as predictor.
Lm fits a regression line to the data using least residual squares. Intercept reflects the points when all the parameters are set to zero. Model parameters were tested using t-test. T test tests if the coefficient differs from 0, ie. if the predictor affects the slope of the regresssion line. In the final model the increase of 1 point in attitude affects the predicted exam point by +3.5 point, with std error of 0.57.
Multiplle r squared refers to the overall variation that the model explains. It reflects the goodness-of-fit of the model. When comparing the models above, removing surf and stra from the analyses led to minimal attunuation of the adjusted r squared, thus those variables did not explain the exam points well. In the last model r-squared was 0.19, so the model with only attitude explained the approximately 19percent of the variation
par(mfrow= c(1,3))
plot(lm2,par(mfrow= c(2,2)), which = c(1,2,5) )
##4 The residuals vs. fitted vlaues show that most of the residuals are symmertical around zero, implicating that the model fits data ok. There seems to be greater variability with fitted values, which may be due the fact that variance is not constant, so transformation could be used to account that.
In the qq-plot the observations fit to the line, but there is some deviations in the low and high values. This indicates that the association may not be linear. Quadratic model could fit the data better.
Leverage vs. residuals shows that some observations are close to cooks distance, but none are significant. Hence there is no reason to remove any of the observations.
Overall, normality and constance of variance seem fine, and there are no observations that would significantly inmpact the model. Thus we can assue that all the assumptioms of linear regression were met.
Of the variables in data, I chose sex, age, quality of family relationships and current health status. Rationale behind the chosen variables was that as sexes differ in their vulnerability to internalizing/externalizing problems, I hypothesize that male sex would be a risk factor for high use of alcohol. Adolescent in different ages may differ in the perceived peer-pressure to drink alcohol, or to blend in the group, hence I hypothesize that older age (ie. closer to twenty) may also be a risk factor. Lastly family relationships and current health status may cause distress in individual, and alcohol may be used as coping mechanism to alleviate the distress.
alc1<-alc[c("age","sex","famrel", "health", "high_use")]
fig <- ggpairs(alc1, mapping = aes(alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))
fig
table(alc1$sex,alc1$high_use)
##
## FALSE TRUE
## F 154 41
## M 105 70
g2<-ggplot(alc1, aes(x=high_use, y= famrel,col=sex))
g2+ geom_boxplot()
g3<-ggplot(alc1, aes(x=high_use, y=health,col=sex))
g3+ geom_boxplot()
cor(alc1$high_use, alc1$age)
## [1] 0.105543
Contrary to my hypothesis, health does not seem to be associated with high alcohol use, though much of the participants were in good health, which may mask some of the effect (i.e. ceiling effect). High alc use correlated positively with age, which is in line with my hypothesis. Also, in line with the earlier hypothesis, men/boys seem to use alcohol more. Family relationships have high mean, and some outliers at the lower end, hence the variable may not be suited to be used as continuous variable in the analyses (thou that is what I intend to do)
#Logistic regression
r1<-glm(high_use~age+sex+health+famrel, data= alc1, family="binomial")
summary(r1)
##
## Call:
## glm(formula = high_use ~ age + sex + health + famrel, family = "binomial",
## data = alc1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4646 -0.8332 -0.6688 1.1524 2.2500
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.17523 1.75142 -2.384 0.01713 *
## age 0.22881 0.09991 2.290 0.02201 *
## sexM 0.95044 0.24127 3.939 8.17e-05 ***
## health 0.12605 0.08830 1.427 0.15344
## famrel -0.36626 0.12858 -2.849 0.00439 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 422.29 on 365 degrees of freedom
## AIC: 432.29
##
## Number of Fisher Scoring iterations: 4
of the chosen variables male sex and family relationships were the statistically significant at alpha level .001, and age at alpha level .05. Current health status was not associated woth high alcohol use. Male sex was associated with considerable increase (log odds= 0.95) in high alcohol use, and age was associated with slight increase (log odds= .22). Good family relationships were associated with lower log odds of high alcohol use (log odds= -.36).
# compute odds ratios (OR)
OR <- coef(r1) %>% exp
# compute confidence intervals (CI)
CI<-confint(r1)%>% exp
## Waiting for profiling to be done...
print(cbind(OR, CI))
## OR 2.5 % 97.5 %
## (Intercept) 0.01537168 0.0004744108 0.4655281
## age 1.25710707 1.0347423229 1.5328666
## sexM 2.58685790 1.6201078828 4.1787499
## health 1.13434220 0.9560769767 1.3527013
## famrel 0.69332056 0.5371929276 0.8909827
Male sex was associated with 2.5 higher odds of having high alcohol use, compared to females, and the 95% confidence intervals range from 1.6 to 4.2. Family relationship was associated with 0.7 times lower odds of having high alcohol use, meaning that one unit increase in family relationship scale decreased the odds of having high alcohol use. Lastly one year increase in age increased the odds of having high alcohol use by 1.2, and 95 CI ranged from 1.03 to 1.5.
# predict() the probability of high_use
probabilities <- predict(r1, type = "response")
# add the predicted probabilities to 'alc'
alc1 <- mutate(alc1, probability = probabilities)
# use the probabilities to make a prediction of high_use
alc1 <- mutate(alc1, prediction = ifelse(probabilities > 0.5,T,F))
# see the last ten original classes, predicted probabilities, and class predictions
# tabulate the target variable versus the predictions
table(high_use = alc1$high_use, prediction = alc1$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 246 13
## TRUE 91 20
alc1 <- mutate(alc1, prediction = probability > 0.5)
# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
loss_func(class = alc1$high_use, prob = alc1$probability)
## [1] 0.2810811
Model is correct in 246+20/13+91= 72% of the cases, compared to random guessing (50/50). Training error is 28%.
Cross validation
# K-fold cross-validation
library(boot)
cv <- cv.glm(data = alc1, cost = loss_func, glmfit = r1, K = 10)
# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.3027027
Current model predicts wrongly in about 30% of the cases in the cross-validation. The number is higher compared to the training data as the model may be over fitted by exploiting the random variation of the training data, which in turn may not be generalized to other data. Cross validation uses smaller subsets of the data to test the generalizability and over-fitting.
#Loading the data and axploring the dimensions and structure
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(ggplot2)
data("Boston")
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506 14
Boston dataset is a part of MASS, and it includes housing values in suburbs of boston. It has 506 observations and 14 variables, of which 12 are numerical and 2 are integers.
#Visualizing
library(corrplot)
## corrplot 0.92 loaded
library(dplyr)
cors <- cor(Boston) %>% round(2)
corrplot(cors,method = "pie",type= "upper")
summary(cors)
## crim zn indus chas
## Min. :-0.3900 Min. :-0.5700 Min. :-0.7100 Min. :-0.12000
## 1st Qu.:-0.2150 1st Qu.:-0.4050 1st Qu.:-0.3825 1st Qu.:-0.04750
## Median : 0.3200 Median :-0.2550 Median : 0.3950 Median : 0.02000
## Mean : 0.1786 Mean :-0.0550 Mean : 0.1929 Mean : 0.08143
## 3rd Qu.: 0.4500 3rd Qu.: 0.2775 3rd Qu.: 0.6300 3rd Qu.: 0.09000
## Max. : 1.0000 Max. : 1.0000 Max. : 1.0000 Max. : 1.00000
## nox rm age dis
## Min. :-0.770 Min. :-0.61000 Min. :-0.7500 Min. :-0.7700
## 1st Qu.:-0.360 1st Qu.:-0.29750 1st Qu.:-0.2625 1st Qu.:-0.5225
## Median : 0.305 Median :-0.21500 Median : 0.3050 Median :-0.3050
## Mean : 0.190 Mean :-0.01286 Mean : 0.1736 Mean :-0.1464
## 3rd Qu.: 0.655 3rd Qu.: 0.19000 3rd Qu.: 0.5775 3rd Qu.: 0.2400
## Max. : 1.000 Max. : 1.00000 Max. : 1.0000 Max. : 1.0000
## rad tax ptratio black
## Min. :-0.4900 Min. :-0.5300 Min. :-0.5100 Min. :-0.44000
## 1st Qu.:-0.2850 1st Qu.:-0.3050 1st Qu.:-0.2175 1st Qu.:-0.37750
## Median : 0.4600 Median : 0.4850 Median : 0.2250 Median :-0.22500
## Mean : 0.2371 Mean : 0.2364 Mean : 0.1157 Mean :-0.06071
## 3rd Qu.: 0.6075 3rd Qu.: 0.6475 3rd Qu.: 0.3775 3rd Qu.: 0.16750
## Max. : 1.0000 Max. : 1.0000 Max. : 1.0000 Max. : 1.00000
## lstat medv
## Min. :-0.7400 Min. :-0.74000
## 1st Qu.:-0.4000 1st Qu.:-0.46000
## Median : 0.4150 Median :-0.38000
## Mean : 0.1407 Mean :-0.06857
## 3rd Qu.: 0.5775 3rd Qu.: 0.31000
## Max. : 1.0000 Max. : 1.00000
#par(mfrow=c(4, 4))
colnames <- dimnames(Boston)[[2]]
for (i in seq_along(colnames)) {
hist(Boston[,i], main=colnames[i], probability=TRUE, col="pink", border="white")
}
The distributions of rm,lstat seem normally distributed. age, dis,medv
seem skewed. Indus and tax seem to have outliers/ otherwise high
variability in the observed values as do black, crim and chas (when
comparing the histograms with the printed summary.
Strongest negative correlations are with distance to employment centers and industry, nitrogen oxide concentration and age (unit build before 1940), all > -.7. Particularly interesting are the weak correlations of chas with any other variable. Of the positive correlations strongest ones are between tax and rad (tax-rate and access to radiall highways), indus and nox and age and nox and indus and nox.
#scaling the data
library(dplyr)
bs <- as.data.frame(scale(Boston))
summary(bs)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
#crime as factor using quantiles
q<-quantile(bs$crim)
bs$crime <- cut(bs$crim, breaks = q, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))
#removing original crim variable
bs<- dplyr::select(bs, -crim)
Standardizing the data centers the obs around 0 so that the distribution has standard distribution of 1, i.e. the data is normally distributed.
#Fitting lda
#copy-pasting the code for training set
n <- nrow(bs)
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# create train set
train <- bs[ind,]
# create test set
test <- bs[-ind,]
#fitting the lda
eka<-lda(crime~., data= train)
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
eka
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2450495 0.2450495 0.2500000 0.2599010
##
## Group means:
## zn indus chas nox rm age
## low 1.01649131 -0.8890656 -0.11325431 -0.8960085 0.4305001 -0.8858801
## med_low -0.05868537 -0.3004102 -0.03371693 -0.5771421 -0.1517818 -0.3792303
## med_high -0.37516530 0.2156823 0.27340760 0.4287381 0.1057845 0.4268801
## high -0.48724019 1.0170492 -0.04735191 1.0350010 -0.4172917 0.7778832
## dis rad tax ptratio black lstat
## low 0.8976768 -0.6848939 -0.7137761 -0.46376153 0.37010732 -0.76315404
## med_low 0.3980887 -0.5352451 -0.4275939 -0.05084611 0.31073482 -0.14708401
## med_high -0.3946093 -0.4099119 -0.2946704 -0.35493036 0.07504817 -0.01106332
## high -0.8393336 1.6388211 1.5145512 0.78158339 -0.78364011 0.85510611
## medv
## low 0.499665597
## med_low -0.006752044
## med_high 0.165448722
## high -0.687011971
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.07689280 0.63829690 -1.04318843
## indus 0.03372808 -0.19385415 0.10829917
## chas -0.08635777 -0.05422482 0.07079801
## nox 0.33601703 -0.80731669 -1.08308824
## rm -0.08551092 -0.06429283 -0.19303210
## age 0.32848192 -0.35337810 -0.24026167
## dis -0.05004636 -0.21963886 0.25512409
## rad 3.22077030 0.81527157 -0.38121181
## tax -0.01118570 0.11949779 0.81393214
## ptratio 0.10345791 0.02818429 -0.15805173
## black -0.11374202 0.04377025 0.11810593
## lstat 0.20032152 -0.11493054 0.51529815
## medv 0.17444824 -0.35809244 0.02366070
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9485 0.0387 0.0128
classes <- as.numeric(train$crime)
plot(eka, dimen = 2,col=classes, pch=classes)
lda.arrows(eka, myscale = 2)
Predicting the correct classes
# save the correct classes from test data
correct_classes <- test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)
pred <- predict(eka, newdata = test)
table(correct=correct_classes, predicted=pred$class)
## predicted
## correct low med_low med_high high
## low 15 11 2 0
## med_low 3 16 8 0
## med_high 0 10 14 1
## high 0 0 0 22
The lda predicts the “high” class well, with few to none misclassifications. There seems to ambiquity with lower levels of the cime variable.
##Reload the Boston dataset and standardize the dataset (we did not do this in the Exercise Set, but you should scale the variables to get comparable distances). Calculate the distances between the observations. Run k-means algorithm on the dataset. Investigate what is the optimal number of clusters and run the algorithm again. Visualize the clusters (for example with the pairs() or ggpairs() functions, where the clusters are separated with colors) and interpret the results. (0-4 points)
data("Boston")
dat1<-scale(Boston)
dat<-as.data.frame(scale(Boston))
#Distances
dist_eu<-dist(dat1)
#setting seed for reproducibility
set.seed(666)
# determine the number of clusters
k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(dat, k)$tot.withinss})
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
# k-means clustering
km <- kmeans(dat, centers = 2)
# plot the Boston dataset with clusters
pairs(Boston[6:10], col = km$cluster)
2 seems to be the optimal number of groups. Some variables seem to
differentiate the groups better, for example tax and rad, but for most
variables there seems to be a lot of overlap.
Bonus: Perform k-means on the original Boston data with some reasonable number of clusters (> 2). Remember to standardize the dataset. Then perform LDA using the clusters as target classes. Include all the variables in the Boston data in the LDA model. Visualize the results with a biplot (include arrows representing the relationships of the original variables to the LDA solution). Interpret the results. Which variables are the most influential linear separators for the clusters? (0-2 points to compensate any loss of points from the above exercises).
set.seed(999)
km2<-kmeans(dat,centers = 4)
targ<-km2$cluster
#variable chas appeared to be constant across groups, thus it was removed
lda2<-lda(targ~crim+zn+age+dis+rad+tax+ptratio+black+lstat+medv+nox+indus+rm, data = dat)
lda2<-lda(targ~., data = dat)
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(targ)
# plot the lda results
plot(lda2, dimen = 2,col= classes)
lda.arrows(lda2, myscale = 1)
pairs(dat[6:10],col= targ)
Radiation seems to affect the model most. ***
Downloading the data, and showing summaries and graphical overview
library(dplyr)
library(GGally)
dat<-read.csv("./data/human2.csv")
row.names(dat)<-dat$X
dat<-dat[,2:9]
summary(dat)
## sedu_rat lab_rat edu liexp
## Min. :0.1717 Min. :0.1857 Min. : 5.40 Min. :49.00
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:11.25 1st Qu.:66.30
## Median :0.9375 Median :0.7535 Median :13.50 Median :74.20
## Mean :0.8529 Mean :0.7074 Mean :13.18 Mean :71.65
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:15.20 3rd Qu.:77.25
## Max. :1.4967 Max. :1.0380 Max. :20.20 Max. :83.50
## gni mat_mor abr repre_parl
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
ggpairs(dat)
Data has 155 observations and 8 variables.
VARIABLES: country= Country liexp= life expectancy gni= Gross National Income per capita liexp = Life expectancy at birth edu = Expected years of schooling mat_mor = Maternal mortality ratio abr = Adolescent birth rate repre_parl = Percetange of female representatives in parliament sedu_rat = ratio of females with at least secondary education divided by males with at least secondary education. lab_rat= proportion of females in work force divided by the proportion of males in work force.
Most of the variables seem to normally distributed. abr seems to be skewed to the right, as is mat_mor. gni is heavily skewed to the right. Life-expectancy seems.
Principal component analysis (PCA) on the raw (non-standardized) human data and bi-plot
A<-prcomp(dat)
summary(A)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## Standard deviation 1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912 0.1591
## Proportion of Variance 9.999e-01 0.0001 0.00 0.00 0.000 0.000 0.0000 0.0000
## Cumulative Proportion 9.999e-01 1.0000 1.00 1.00 1.000 1.000 1.0000 1.0000
s<-summary(A)
pca_pr <- round(1*s$importance[2, ], digits = 5)
pca_pr2<-round(pca_pr*100,digits = 2)
pca_lab<-paste0(names(pca_pr), " (", pca_pr2, "%)")
biplot(A, choices = 1:2, cex = c(0.8, 1), col = c("grey70", "deeppink2"), xlab = pca_lab[1], ylab = pca_lab[2])
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
PCA on standardized data
dat1<-scale(dat)
dat1<-as.data.frame(dat1)
B<-prcomp(dat1)
summary(B)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.0708 1.1397 0.87505 0.77886 0.66196 0.53631 0.45900
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595 0.02634
## Cumulative Proportion 0.5361 0.6984 0.79413 0.86996 0.92473 0.96069 0.98702
## PC8
## Standard deviation 0.32224
## Proportion of Variance 0.01298
## Cumulative Proportion 1.00000
s<-summary(B)
pca_pr <- round(1*s$importance[2, ], digits = 5)
pca_pr2<-round(pca_pr*100,digits = 2)
pca_lab<-paste0(names(pca_pr), " (", pca_pr2, "%)")
biplot(B, choices = 1:2, cex = c(0.8, 1), col = c("grey70", "deeppink2"), xlab = pca_lab[1], ylab = pca_lab[2])
Both plot and the explained variance changed drastically. PCA seeks to explain variance. Without standardizing, the first variable seems to explain most of the variance. After standardizing it is obvious that other components also contribute, but only the first two components explain more than 10% of the variance.
In the first plot, only gni /which happens to be the first variable) seems to explain the variance, whereas in the second plot maternal mortality, birth rate, education, life expectancy and ratio of the secondary education seem to explain the first component and ratio of labour force and representation of women in parliament explain the second component.
Interpretations of the obtained results
It seems that first component relates to welfare statish and the component2 more to equity of the sexes. When looking at the countries in the plot, most of the western states are located in the left, whereas less wealthy countires are located in the right. When looking at the second component, countries in which the status of women is not equal to that of men are located at the bottom of plot. INterestingly some of the african countries are at the top of the plot. It is worth noting that there may be some interactions between the components chosen by pca, as in wealthier countries it may possible for the other sex to not work, whereas in poorer countries both sexes may have to work
The tea data comes from the FactoMineR package and it is measured with a questionnaire on tea: 300 individuals were asked how they drink tea (18 questions) and what are their product’s perception (12 questions). In addition, some personal details were asked (4 questions).
Load the tea dataset and convert its character variables to factors:
tea <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv", stringsAsFactors = TRUE)
library(dplyr)
library(tidyr)
str(tea)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "+60","15-24",..: 4 5 5 2 5 2 4 4 4 4 ...
## $ frequency : Factor w/ 4 levels "+2/day","1 to 2/week",..: 3 3 1 3 1 3 4 2 1 1 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea)
## [1] 300 36
tea1<-(tea[,1:10])
ggpairs(tea[,1:10])
pivot_longer(tea1, cols = everything()) %>%
ggplot(aes(value)) + facet_wrap("name", scales = "free") + geom_bar()+
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
Exploring the data briefly.
library(FactoMineR)
mca <- MCA(tea1, graph = FALSE)
# summary of the model
summary(mca)
##
## Call:
## MCA(X = tea1, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 0.179 0.129 0.115 0.107 0.104 0.090 0.079
## % of var. 17.939 12.871 11.481 10.654 10.406 8.985 7.917
## Cumulative % of var. 17.939 30.810 42.291 52.945 63.351 72.337 80.253
## Dim.8 Dim.9 Dim.10
## Variance 0.076 0.068 0.054
## % of var. 7.559 6.793 5.395
## Cumulative % of var. 87.812 94.605 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ctr
## 1 | -0.543 0.549 0.474 | -0.462 0.552 0.342 | -0.007 0.000
## 2 | -0.543 0.549 0.474 | -0.462 0.552 0.342 | -0.007 0.000
## 3 | -0.068 0.009 0.002 | 0.542 0.761 0.141 | -0.139 0.056
## 4 | -1.037 1.998 0.558 | 0.264 0.180 0.036 | -0.018 0.001
## 5 | -0.235 0.103 0.061 | -0.011 0.000 0.000 | 0.124 0.044
## 6 | -1.037 1.998 0.558 | 0.264 0.180 0.036 | -0.018 0.001
## 7 | 0.024 0.001 0.001 | -0.404 0.422 0.374 | -0.185 0.099
## 8 | -0.193 0.069 0.054 | 0.058 0.009 0.005 | -0.382 0.423
## 9 | -0.258 0.124 0.117 | -0.624 1.007 0.680 | -0.130 0.049
## 10 | -0.048 0.004 0.002 | -0.153 0.061 0.020 | -0.176 0.090
## cos2
## 1 0.000 |
## 2 0.000 |
## 3 0.009 |
## 4 0.000 |
## 5 0.017 |
## 6 0.000 |
## 7 0.078 |
## 8 0.210 |
## 9 0.030 |
## 10 0.027 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2
## breakfast | 0.209 1.168 0.040 3.471 | -0.770 22.090 0.547
## Not.breakfast | -0.193 1.078 0.040 -3.471 | 0.710 20.391 0.547
## Not.tea time | -0.680 11.253 0.358 -10.351 | 0.327 3.635 0.083
## tea time | 0.527 8.723 0.358 10.351 | -0.254 2.817 0.083
## evening | 0.445 3.787 0.103 5.562 | 0.634 10.728 0.210
## Not.evening | -0.233 1.980 0.103 -5.562 | -0.332 5.609 0.210
## lunch | 0.947 7.329 0.154 6.788 | 0.167 0.317 0.005
## Not.lunch | -0.163 1.260 0.154 -6.788 | -0.029 0.054 0.005
## dinner | -1.571 9.633 0.186 -7.454 | 1.044 5.925 0.082
## Not.dinner | 0.118 0.725 0.186 7.454 | -0.079 0.446 0.082
## v.test Dim.3 ctr cos2 v.test
## breakfast -12.786 | -0.031 0.041 0.001 -0.518 |
## Not.breakfast 12.786 | 0.029 0.038 0.001 0.518 |
## Not.tea time 4.983 | 0.235 2.102 0.043 3.579 |
## tea time -4.983 | -0.182 1.630 0.043 -3.579 |
## evening 7.929 | -0.599 10.742 0.188 -7.494 |
## Not.evening -7.929 | 0.313 5.616 0.188 7.494 |
## lunch 1.195 | -1.133 16.410 0.221 -8.125 |
## Not.lunch -1.195 | 0.195 2.820 0.221 8.125 |
## dinner 4.951 | -0.090 0.050 0.001 -0.428 |
## Not.dinner -4.951 | 0.007 0.004 0.001 0.428 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## breakfast | 0.040 0.547 0.001 |
## tea.time | 0.358 0.083 0.043 |
## evening | 0.103 0.210 0.188 |
## lunch | 0.154 0.005 0.221 |
## dinner | 0.186 0.082 0.001 |
## always | 0.089 0.096 0.414 |
## home | 0.008 0.114 0.005 |
## work | 0.215 0.006 0.251 |
## tearoom | 0.315 0.003 0.018 |
## friends | 0.325 0.141 0.008 |
# visualize MCA
plot(mca, invisible=c("ind"), graph.type = "classic",habillage = "quali")
According the plot, having tea at diner or not at home have the greatest impact on the first 2 dimensions, variables close to horizontal x-axis have no lesser effect on dimension 2 and variables closer to y axis have less effect on dimension1. Most of the “not” variables are close to origo, and their counterparts are located further away, meaning that they have impact on the two dimensions.
Dimension 2 seems to reflect the social affect of drinking tea. Factors related to social aspects of drinking tea seem to affect the y-value in the plot. Dimension 1 is harder to interpret, but one possibility is that dim1 reflects the “tea as a hobby” kind of aspect. things like drinking tea in a tearoom and having tea time.